Linux cgroup CPU Limit
1st:lkinoue.icon
2nd: kimitoboku.icon
そもそもControl Groups(cgroups)とは?
cgroup により、ユーザーは、CPU 時間、システムメモリー、ネットワーク帯域幅などのリソースやそれらのリソースの組み合わせを、システム上で実行中のユーザー定義タスクグループ (プロセス) の間で割り当てることができるようになります。 https://access.redhat.com/documentation/ja-jp/red_hat_enterprise_linux/6/html/resource_management_guide/ch01
プロセス単位でリソースの制御ができるLinuxの機能
cgroups v1 and v2
v1: https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v1/cgroups.html
subsystemという形で各リソースに関する設定が存在
cpu, mem, blkio(ブロックデバイス), etc...
設定に使用するcgroupsという特殊なfilesystemが/sys/fs以下にマウントされている(ことが多い)
https://access.redhat.com/documentation/ja-jp/red_hat_enterprise_linux/8/html/managing_monitoring_and_updating_the_kernel/setting-cpu-limits-to-applications-using-cgroups-v1_setting-limits-for-applications
e.g.) cpu: /sys/fs/cgroup
v2: https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html
https://elixir.bootlin.com/linux/latest/source/kernel/cgroup
https://github.com/containerd/cgroups
https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/resource_management_guide/sec-cpu
cgroups v1
https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v1/cgroups.html
cgroups v2
https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html
https://elixir.bootlin.com/linux/latest/source/kernel/sched/cpuacct.c#L281
cfs_quota_usを手がかりにたどる
https://github.com/torvalds/linux/blob/3d7cb6b04c3f3115719235cc6866b10326de34cd/kernel/sched/core.c#L10850
code:c
#ifdef CONFIG_CFS_BANDWIDTH
{
.name = "cfs_quota_us",
.read_s64 = cpu_cfs_quota_read_s64,
.write_s64 = cpu_cfs_quota_write_s64,
},
https://github.com/torvalds/linux/blob/3d7cb6b04c3f3115719235cc6866b10326de34cd/kernel/sched/core.c#L10569
code:c
static int tg_set_cfs_quota(struct task_group *tg, long cfs_quota_us)
{
u64 quota, period, burst;
period = ktime_to_ns(tg->cfs_bandwidth.period);
burst = tg->cfs_bandwidth.burst;
if (cfs_quota_us < 0)
quota = RUNTIME_INF;
else if ((u64)cfs_quota_us <= U64_MAX / NSEC_PER_USEC)
quota = (u64)cfs_quota_us * NSEC_PER_USEC;
else
return -EINVAL;
return tg_set_cfs_bandwidth(tg, period, quota, burst);
}
https://github.com/torvalds/linux/blob/3d7cb6b04c3f3115719235cc6866b10326de34cd/kernel/sched/core.c#L10481
code:c
static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota, u64 burst)
{
...
ret = __cfs_schedulable(tg, period, quota);
if (ret)
goto out_unlock;
...
}
https://github.com/torvalds/linux/blob/3d7cb6b04c3f3115719235cc6866b10326de34cd/kernel/sched/core.c#L10744-L10763
code:c
static int __cfs_schedulable(struct task_group *tg, u64 period, u64 quota)
{
...
ret = walk_tg_tree(tg_cfs_schedulable_down, tg_nop, &data);
}
https://github.com/torvalds/linux/blob/3d7cb6b04c3f3115719235cc6866b10326de34cd/kernel/sched/core.c#L10711-L10742
code:c
static int tg_cfs_schedulable_down(struct task_group *tg, void *data)
{
struct cfs_schedulable_data *d = data;
struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth;
s64 quota = 0, parent_quota = -1;
if (!tg->parent) {
quota = RUNTIME_INF;
} else {
struct cfs_bandwidth *parent_b = &tg->parent->cfs_bandwidth;
quota = normalize_cfs_quota(tg, d);
parent_quota = parent_b->hierarchical_quota;
/*
* Ensure max(child_quota) <= parent_quota. On cgroup2,
* always take the min. On cgroup1, only inherit when no
* limit is set:
*/
if (cgroup_subsys_on_dfl(cpu_cgrp_subsys)) {
quota = min(quota, parent_quota);
} else {
if (quota == RUNTIME_INF)
quota = parent_quota;
else if (parent_quota != RUNTIME_INF && quota > parent_quota)
return -EINVAL;
}
}
cfs_b->hierarchical_quota = quota;
return 0;
}
dfl is default (link)
static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
static void enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
this is set as function ptr for sheduling class
static void __sched notrace __schedule(unsigned int sched_mode)
static struct task_struct * pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) (SMT)
static inline struct task_struct * __pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
Pick up the highest-prio task
calll per scheduling class implementation
struct task_struct * pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
https://github.com/torvalds/linux/blob/1612c382ffbdf1f673caec76502b1c00e6d35363/kernel/sched/core.c#L10836
code:c
static int cpu_cfs_stat_show(struct seq_file *sf, void *v)
{
...
}
https://lwn.net/Articles/167897/
https://github.com/torvalds/linux/blob/3d7cb6b04c3f3115719235cc6866b10326de34cd/kernel/sched/fair.c#L5357
https://github.com/torvalds/linux/blob/3d7cb6b04c3f3115719235cc6866b10326de34cd/kernel/sched/fair.c#L5434
hrtimer period_timer
https://qiita.com/nhiroki/items/2fa7bb048118145b00cd
comment
CFS Bandwidth Control
cpu.cfs_quota_us を制限すると、idleが増える
https://iret.media/wp-content/uploads/2015/01/20150116_cgroup_005-2-1024x573.png
出典
cgroup v1では、どの階層にもタスクを登録できる
https://gihyo.jp/assets/images/admin/serial/01/linux_containers/0037/thumb/TH800_001.png
出典
https://elixir.bootlin.com/linux/latest/source/kernel/sched/core.c#L10525
code:c
static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota, u64 burst)
https://stackoverflow.com/questions/8306202/what-does-nr-stand-for-in-system-call-number-that-is-usually-used-as-suffix
nr is abbrev of number